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Abstract 

Background: As it becomes increasingly possible to obtain DNA sequences of orthologous genes from diverse sets 
of taxa, species trees are frequently being inferred from multilocus data. However, the behavior of many methods for 
performing this inference has remained largely unexplored. Some methods have been proven to be consistent given 
certain evolutionary models, whereas others rely on criteria that, although appropriate for many parameter values, 
have peculiar zones of the parameter space in which they fail to converge on the correct estimate as data sets 
increase in size. 

Results: Here, using North American pines, we empirically evaluate the behavior of 24 strategies for species tree 
inference using three alternative outgroups (72 strategies total). The data consist of 1 20 individuals sampled in eight 
ingroup species from subsection Strobus and three outgroup species from subsection Gerardianae, spanning ~47 
kilobases of sequence at 121 loci. Each "strategy" for inferring species trees consists of three features: a species tree 
construction method, a gene tree inference method, and a choice of outgroup. We use multivariate analysis 
techniques such as principal components analysis and hierarchical clustering to identify tree characteristics that are 
robustly observed across strategies, as well as to identify groups of strategies that produce trees with similar features. 
We find that strategies that construct species trees using only topological information cluster together and that 
strategies that use additional non-topological information (e.g., branch lengths) also cluster together. Strategies that 
utilize more than one individual within a species to infer gene trees tend to produce estimates of species trees that 
contain clades present in trees estimated by other strategies. Strategies that use the minimize-deep-coalescences 
criterion to construct species trees tend to produce species tree estimates that contain clades that are not present in 
trees estimated by the Concatenation, RTC, SMRT, STAR, and STEAC methods, and that in general are more balanced 
than those inferred by these other strategies. 

Conclusions: When constructing a species tree from a multilocus set of sequences, our observations provide a basis 
for interpreting differences in species tree estimates obtained via different approaches that have a two-stage structure 
in common, one step for gene tree estimation and a second step for species tree estimation. The methods explored 
here employ a number of distinct features of the data, and our analysis suggests that recovery of the same results 
from multiple methods that tend to differ in their patterns of inference can be a valuable tool for obtaining reliable 
estimates. 
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Background 

In phylogenetic studies, it has become increasingly com- 
mon to sequence large numbers of individuals at many 
loci (e.g., [1-4]). While these multilocus datasets pro- 
vide the potential to improve the accuracy of phylogeny 
inferences over large sets of taxa, for a variety of rea- 
sons, topologies of trees inferred at different loci might 
not match [5]. One source of this gene tree discordance 
is incomplete lineage sorting, the phenomenon in which 
sets of sampled lineages fail to coalesce in the popula- 
tion in which they are first capable of coalescing [6]. With 
incomplete lineage sorting, several species tree inference 
methods — including Concatenation [7], Democratic Vote 
Consensus [8], Greedy Consensus [9], Majority- Rule Con- 
sensus [9], Matrix Representation with Parsimony [10], 
and Minimize Deep Coalescences [11]— can be misled by 
discordance of gene trees across loci. 

Numerous approaches have been used for estimating 
species tree topologies from multilocus sequence data. 
Consensus methods construct species tree topologies 
from gene trees according to deterministic rules applied 
to the input set of trees [12,13]. These methods take as 
input a set of gene trees produced from individual loci, 
allowing for separate input evolutionary histories at each 
locus. Because genetic lineages in different species some- 
times have relatively few sequence differences, however, 
the information in a locus can be insufficient to accurately 
infer gene trees, thereby allowing incorrect gene trees 
to adversely influence the constructed species tree (e.g., 
[14,15]). Concatenation methods concatenate a set of mul- 
tiple alignments and construct a tree from the concate- 
nated alignment, treating the estimated "super-gene" tree 
as a species tree estimate [6,13,16]. Because concatenation 
combines all loci to form a single locus, and because dif- 
ferent loci can have different evolutionary histories that 
are disregarded in the analysis of the concatenated align- 
ment, the analysis of loci in this way can lead to incorrect 
species tree inferences [7]. Consensus and concatenation 
have in common that they are "two-stage" methods, in 
which species trees are inferred in two steps — inference 
of gene trees and then species tree inference for consen- 
sus, and inference of the super-gene tree followed by a 
conceptually substantial though methodologically trivial 
pronouncement that this tree is the species tree estimate 
for concatenation [6,12,13]. A third class of approaches 
can be labeled "single-stage" methods, in which species 
trees are inferred by simultaneously modeling the evo- 
lution of sequences among all sampled loci to output a 
species tree estimate [17-20]. These single-stage model- 
based methods often have desirable statistical properties, 
but because they typically explore large spaces of possi- 
bilities rather than algorithmically constructing estimated 
trees, they can be computationally intensive and applica- 
ble only to smaller datasets. 



Properties of species tree inference methods can be 
examined using a variety of frameworks, including the- 
ory, simulations, and empirical assessments. Theoretical 
investigations are often concerned with limiting proper- 
ties as the number of loci approaches infinity [9-11,15,21- 
27]. In-depth explorations of inference methods often rely 
on simulation studies, which are commonly used to inves- 
tigate the performance of species tree inference methods 
on simulated multilocus datasets [10,28-32]. These theo- 
retical and simulation-based studies have the advantage 
of knowing the true species tree, but the disadvantage 
that the scenarios they examine lack the complexity of 
empirical data. 

An alternative approach is to evaluate methods on an 
empirical dataset in which the space of parameter val- 
ues is defined by the evolutionary history of a group 
of species. Recent studies have empirically investigated 
the performance of species tree methods from multilo- 
cus datasets in a variety of organisms, including birds 
[3,33-36], insects [37,38], newts [39], plants [40], pri- 
mates [41,42], rice [1], rodents [43], snakes [44], and 
yeast [4,16,36,45,46]. While some of these studies con- 
structed highly-supported species trees, others did not, 
possibly due to high levels of genealogical discordance 
resulting from incomplete lineage sorting, hybridization, 
and ancient rapid radiations. 

In one such study, [47] found that samples from a 
multilocus dataset of North American pines displayed 
widespread genealogical discordance. This pattern of 
incomplete lineage sorting is a common feature of long- 
lived shrubs and trees (e.g., [48-50]), and likely arises from 
factors such as large effective population sizes and long 
generation times [51]. Because gene tree discordance is 
needed for different algorithms to produce different esti- 
mates, high levels of gene tree discordance make North 
American pines an interesting group in which to compare 
species tree inference methods. 

In this study, we take an empirical approach to the eval- 
uation of species tree inference methods by examining 
the performance of 72 strategies for inferring species tree 
topologies using multilocus data from North American 
pines. Each "phylogenetic inference strategy" consists of 
three components: a method of constructing species trees 
from gene trees (e.g., consensus or concatenation), a gene 
tree inference method (e.g., maximum likelihood, maxi- 
mum parsimony, or neighbor-joining), and an outgroup 
species. Our framework thus focuses on two-stage infer- 
ence strategies that can be separated into gene tree infer- 
ence and species tree inference steps, so that the effect 
of the choices of gene tree and species tree estimators 
can be directly evaluated. We examine ~47 kilobases (kb) 
of sequence spanning 121 nuclear loci sequenced in 120 
individuals from eight ingroup species of Pinus subsec- 
tion Strobus (Table 1) and three outgroup species of Pinus 



Table 1 Ranges and morphological characteristics differentiating eight North American species of Pinus subsection Strobus 



Taxa 


Common name 


Range 


Elevation (m) 


Seed cone 
length (cm) 


Seeds 


Dispersal 


Related notes 


P. olbicoulis ] 


Whitebark pine 


Central British Columbia and Alberta south 
to central California; northern Rockies west to 
the Cascades and Sierra Nevadas 


1300 - 3700 




p 

o 


Wingless 


Birds and rodents 


Closed cone morphology where scales are 
opened through animal agency exclusively; 
timberline species. 


P. oyocohuite 2 


Mexican white pine 


Central Mexico to Honduras; sympatric with 
P. chiapensis at lower elevations 


2300 - 3200 


14- 


-40 


Winged 


Wind 


Southernmost species of the North American 
subsection Strobus; among the largest of 
Mexican pines. 


P. chiopensis 2 


Chiapas pine 


Veracruz, Mexico to northwestern Guatemala; 
sympatric with P. oyocohuite at upper eleva- 
tions 


260 - 2300 


8- 


16 


Winged 


Wind 


Formally considered as a disjunct population 
of Pinus strobus var. chiopensis. 


P. flexilis ] 


Limber pine 


Rocky Mountains and Intermountain Ranges 
from Canada south into the central US 


1500 - 3600 


7 - 


15 


Wingless 


Birds and rodents 


Often found at timberline; oldest trees date 
beyond 1600 years. 


P. lombertiono ] 


Sugar pine 


Oregon, California, Nevada, and isolated 
population in northern Baja California 


330 - 3200 


25 - 


-50 


Winged 


Wind 


Largest species and longest seed cone of 
Pinus; unable to hybridize with any other 
North American pine. 


P. monticolo ] 


Western white pine 


Southern British Columbia to south-central 
California; northern Rockies, Cascades, and 
Sierra Nevadas 


0 - 3000 


10- 


-25 


Winged 


Wind 


Found in moist, montane forests while most 
other western species are relegated to drier 
and more exposed sites. 


P. strobiformis ] 


Southwestern white pine 


Northern Mexico extending into central 
Arizona and New Mexico 


1900 - 3500 


15 - 


-25 


Wingless 


Birds and rodents 


Range intergrades with P. oyocohuite to the 
south and P. flexilis to the north, and with 
these two species, forms a well-documented 
complex. 


P. strobus ] 


Eastern white pine 


Southern Canada south to Georgia; 
Newfoundland to western Ontario and 
Minnesota 


0- 1500 


8- 


20 


Winged 


Wind 


Only member of this group to occur in the 
eastern US and Canada; allopatric from all 
other taxa in subsection Strobus. 
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subsection Gerardianae. With our empirical approach, 
unlike in simulation-based and theoretical evaluations, 
the true species tree is not known. It is still possible, 
however, to evaluate properties of species tree estimators 
without knowledge of a true tree, by comparing the fea- 
tures of species trees inferred by different methods. We 
apply techniques from multivariate statistical analysis to 
sets of inferred species trees to compare characteristics 
of species trees estimated by different strategies and to 
identify groups of strategies that behave similarly. 

Methods 

North American white pine dataset 

A total of 120 individuals were sequenced in eight 
ingroup species of North American white pines from 
Pinus subsection Strobus (Pinus albicaulis Engelmann, 
P ayacahuite Ehrenberg ex Schlechtendal, P chiapensis 
(Martinez) Andresen, P flexilis James, P lambertiana 
Douglas, P monticola Douglas ex D. Don, P strobiformis 
Engelmann, and P strobus L.) and three outgroup 
species from Pinus subsection Gerardianae (P bungeana 
Zuccarini ex Endlicher, P gerardiana Wallich ex 
D. Don, P squamata X. W. Li), the identified sister lin- 
eage to Pinus subsection Strobus [47,54]. Sequencing was 
conducted on haploid templates generated from DNA 
extractions of seed megagametophyte tissue; as a single 
haploid sequence was generated for each individual at 
each locus, no phasing was necessary. Gene sequences 
were obtained from 245 putative nuclear loci chosen from 
among ~7,500 loci recently resequenced for loblolly pine 
(Pinus taeda L., http://loblolly.ucdavis.edu/bipod/ftp/) 
using single pass, bidirectional Sanger sequencing of PCR 
products amplified from haploid megagametophyte tissue 
excised from seeds of each species. Further description 
of laboratory protocols appears in [55]. Sequence data 
were pre-processed and organized using PineSAP [56], 
a bioinformatics pipeline that combines Phred [57], 
Phrap [58], and MUSCLE [59,60] to call bases and align 
sequencing reads. Reported nucleotide sequences con- 
sisted only of A, C, G, T, missing, and gap information, 
with no other ambiguity codes used. After pre-processing, 
the data were manually assembled and aligned using 
CodonCode (CodonCode Corporation, Dedham, MA). 
Bases were called using a minimum Phred score [57,61] 
of 25 for aligned bases. All polymorphisms were visually 
validated. All alignments were further aligned to rese- 
quencing data from P taeda (unpublished data) using 
the profile-profile option in MUSCLE [59,60]. These 
alignments are publicly available as part of the Dendrome 
project (http://loblolly.ucdavis.edu/bipod/ftp/). GenBank 
accession numbers for sequences in the study appear in 
Additional file 1: Table SI. 

Of 245 loci sequenced initially, 37 were dropped 
from further consideration due to low overall quality of 



sequence reads. An additional 15 loci were discarded due 
to possible chloroplast or mitochondrial contamination, 
on the basis of BLAST analysis against pine organel- 
lar sequences in GenBank [62]. Two loci were dropped 
due to sequence similarity to retroelement-like proteins, 
leaving 191 high-quality nuclear gene alignments. We 
then eliminated 70 loci for which at least one of the 
11 species contained no data. This filter reduced the 
dataset to 121 loci, covering ~47 kb of aligned sequence 
data. 

Coding regions (i.e., site annotations) could be con- 
fidently identified for 112 of the 121 loci by further 
analysis using TBLASTx against protein-coding genes 
in Arabidopsis, Oryza, Picea, and Populus. For these 
112 loci, the gene for the highest-scoring TBLASTx 
hit, in combination with the expressed sequence tag 
from loblolly pine, was used to identify coding regions. 
Site annotations for each alignment were validated with 
BLASTp analysis of the amino acid sequences derived 
from the inferred coding intervals against the gene that 
was used to derive the site annotations. For the data 
from the 112 annotated loci, ~62% represents exonic 
regions, ~18% represents intronic regions, ~1% is from 
5' UTRs, and -19% is from 3' UTRs. Because 112 of the 
loci could be confidently identified as belonging to coding 
regions, with a substantial fraction of exonic sequence, the 
data likely contain a mixture of non-neutral and neutral 
regions. 

Overview of the analysis 

The procedure for obtaining results for each of the 72 phy- 
logenetic inference strategies (listed in Additional file 2: 
Table SI) appears in Figure 1. For a given strategy, we 
started from a dataset D with L loci. To generate distri- 
butions on the set of clades inferred by a given strategy, 
we used the bootstrap, creating bootstrap replicates by 
randomly choosing with replacement B sets of L loci. As 
many of the loci are coding and the eight pine species 
are closely related, we chose not to bootstrap across sites 
within a locus to ensure that bootstrapped alignments 
would contain reasonable levels of variation. Next, we 
applied a gene tree inference method to each bootstrap 
replicate dataset. Based on the set of inferred gene trees 
in a bootstrap replicate, we then applied a species tree 
construction method to estimate a species tree topol- 
ogy with one of the three outgroup species. For each 
phylogenetic inference strategy, we constructed B = 
1000 independent bootstrap datasets, thereby estimat- 
ing 1000 species tree topologies. From these topologies, 
we created a list of clades, each with a corresponding 
count of its number of appearances in the 1000 boot- 
strap replicates. Clade lists were then analyzed to assess 
differences among the estimates produced by different 
strategies. 
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Choose an outgroup species Ofrom the set of three possible 
outgroups: P. gerardiana, P. bungeana, and P. squamata 



3 



Dataset D 



Dataset D. 



For each gene, construct 
NJ tree and root the tree 
using outgroup O 



Dataset D n 



Dataset D 




Dataset D s0 





For each gene, construct 
ML, MP, or NJ tree and 
root the tree using 
outgroup O 



Create 1 000 bootstrap sets 
of gene trees by randomly 
choosing with replacement 
L gene trees 



Create 1 000 bootstrap 
average p-distance matrices 
by randomly choosing with 
replacement L genes and 
calculating the distance as P a " 



Create 1 000 bootstrap 
concatenated alignments 
by randomly choosing with 

replacement L genes to 
concatenate 



Create 1 000 bootstrap sets 
of gene trees by randomly 
choosing with replacement 
L s0 gene trees 



Estimate the species tree 
using either STAR, STEAC, 

RTC, or MDC for each of 
the 1 000 bootstrap sets of 
gene trees 



Estimate the species tree Estimate the species tree 



using NJ for each of the 
1000 bootstrap distance 
matrices and root each 
tree with outgroup O 



using SMRT with NJ for 

each of the 1 000 
bootstrap distance 
matrices and root each 
tree with outgroup O 



Estimate the species tree Estimate the species tree 



using either ML, MP, or 
NJ for each of the 1 000 
bootstrap concatenated 
alignments and root each 
tree with outgroup O 



using SMRT with either 
ML, MP, or NJ for each 
of the 1 000 bootstrap 
concatenated alignments 
and root each tree with 
outgroup O 



Estimate the species tree 
using either STAR, STEAC, 

RTC, or MDC for each of 
the 1 000 bootstrap sets of 
gene trees 



3x4 + 3 + 3 + 3x3 + 3x3 + 3x3x4 = 72 phylogenetic inference strategies 

Figure 1 Flow diagram representing the procedure in which we obtained results on the behavior of phylogenetic inference strategies. 

A boldface number attached to a downward arrow indicates the number of phylogenetic inference strategies generated by the box immediately 
above the arrow. Absence of a number indicates a value of 1 .The number of strategies for a particular path from the topmost box to the bottom 
layer is calculated as the product of the boldface numbers visited during the traversal of the path. The number of phylogenetic inference strategies 
analyzed is 72, the sum over all paths from the topmost box to boxes in the bottom layer. 



Creating datasets 

Our final set of 121 loci contains many loci that are 
highly conserved across multiple species. Because of the 
high level of conservation, for these loci, little information 
exists for identifying relationships among lineages. Thus, 
if methods for inferring gene trees were applied to certain 
loci, the resulting gene trees would be highly unresolved 
and would therefore provide little information to species 
tree construction methods. This issue motivates the con- 
struction of datasets that attempt to reduce the chance of 
inferring highly unresolved trees, and that provide phylo- 
genetic inference strategies with the maximal amount of 
sequence data available. 

We therefore analyzed four carefully selected subsets of 
the initial dataset (Table 2; Additional file 3). Two of these 
are datasets of multiple alignments that contain infor- 
mation on a single individual per species (D s and V s $). 
The other two contain information on multiple individ- 
uals per species (D p and V p> o). These four datasets are 
constructed such that each possesses desirable properties 
for certain strategies in the collection of 72 phylogenetic 
inference strategies, providing the strategies with as much 
information as possible to infer resolved phylogenies. For 



example, because it is desirable for a pair of species to 
have nonzero distance, we require pairs of distinct species 
to be separated by at least one observed mutation. Fur- 
thermore, because it is desirable to minimize missing 
data, we choose individuals that yield minimal missing 
data in a multiple alignment. One of the two datasets 
with a single individual sampled per species is optimized 
for locus-by-locus gene tree inference (D s> q), whereas the 
other is optimized for gene tree inference from a concate- 
nated alignment (D s ). Similarly, one of the two datasets 
with multiple individuals sampled per species is optimized 
for locus-by-locus gene tree inference (D P) o), whereas the 
other is optimized for gene tree inference using multiple 
loci simultaneously (D p ). The procedures used for con- 
structing these datasets appear in Sections on "Datasets 
with one individual per species" and "Datasets with mul- 
tiple individuals per species". 

Let Sfr, k = 1,2, . . . , 11, denote the set of individu- 
als from pine species /c, considering eight ingroup species 
(Si,S2t • • • t Ss) and three outgroup species (S9,Sio,Sn). 
Denote the amount of overlapping non-gap non-missing 
sequence between a pair of individuals x and y by n xy and 
denote the number of non-gap non-missing nucleotide 
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Table 2 Datasets 



Dataset Strategies that use the dataset Number of strategies Description 





Concatenation or SMRT with ML, MP, or NJ 


18 


Consists of all 121 loci, with a single 
individual sampled from each of 1 1 
species at each locus. 




STEAC, STAR, RTC, or MDC with ML, MP, or NJ 


36 


Subset of V s requiring that each 
locus has at least one sequence dif- 
ference between each distinct pair 
of species, otherthan pairs from dis- 
tinct outgroups. 


v P 


Concatenation or SMRT with M 


6 


Consists of the full dataset V, which 
contains all individuals and all loci. 




STEAC, STAR, RTC, or MDC with M 


12 


Subset of Dp requiring that each 
locus has at least one sequence dif- 
ference between each distinct pair 
of species, otherthan pairs from dis- 
tinct outgroups. 



differences between a pair of individuals x and y by d xy 
(0 < dxy < n xy ). Further, denote the final dataset of 
L = 121 loci by V = {A\, A2, . • • , Al}> where At is the 
set of aligned sequences at locus i for individuals from all 
11 species. It is from dataset V that we create the four 
optimized datasets as summarized in Table 2 and Figure 2. 

Datasets with one individual per species 

The first dataset, V s , consists of alignments with a sin- 
gle individual sampled per species at each locus (not 
necessarily the same individual across loci). That is, we 
generate a dataset of multiple alignments at each of L 
loci with only one individual per species, thereby creat- 
ing multiple alignments of 11 individuals. This dataset 
is used by phylogenetic inference strategies that utilize 
the concatenation-based species tree construction meth- 
ods with the maximum likelihood, maximum parsimony, 
and neighbor-joining gene tree inference methods (see 
"Inferring gene trees" and "Inferring species trees"). To 
create V s , we choose the subset of 11 sequences A\ at 
locus i by first maximizing the total overlap sequence 
x,yeA s o,x^y n xy an( ^ then, if there is a tie for the 
overlap n(A\), maximizing the total number of substitu- 
tions d(A\) = J2 x ,yeA s £ ,x^y dxy In other words, for any 
other set of aligned sequences Af c At at locus i with 
a set R of only one individual sampled per species, the 
amount of overlapping non-gap non-missing sequence in 
Af is no larger than in A s v i.e., n(Af) < n(A s t ). We note 
that the quantity n xy represents a calculation only on a pair 
of individuals x and y, whereas n(A\) considers all 
pairs of individuals. Further, for any other set of aligned 
sequences Af c At at locus i with a set R of only one 
individual sampled per species and n(Af) = n(A s »), the 
total number of pairwise sequence differences in Af is no 
larger than in A\, i.e., d(Af) < d(A\). If multiple sets R 
of 11 individuals share the same values of n and d, then 
we choose the set of 11 individuals randomly among the 



tied sets. We choose the "optimal" set of 11 individuals at 
each locus in this way both to maximize the sequence con- 
tributions of individual loci to the inference of gene trees 
(maximizing n) and to maximize the potential for creating 
resolved gene trees (maximizing d). 

The second dataset, V s $, is a subset of V s with L S) o < L 
loci that consists of only those loci in V s for which there 
exists at least one nucleotide difference between each dis- 
tinct pair of species (other than pairs of outgroup species). 
In other words, for any pair of individuals x and y with 
x,y e A\ and x 7^ 3/, d xy > 0, and d xy > 0 if ^, y, 
or both are from species 1 through 8. This condition of 
at least one nucleotide difference between species pairs 
assists in constructing gene trees that are bifurcating. 
Dataset V S) o is used by phylogenetic inference strate- 
gies that utilize consensus methods with maximum like- 
lihood, maximum parsimony, and neighbor-joining (see 
"Inferring gene trees" and "Inferring species trees" for 
details). 

Datasets with multiple individuals per species 

The third dataset, V p , is identical to our starting dataset 
V. Thus, strategies that use V p consider all available 
sequences. Dataset V p is used by phylogenetic inference 
strategies that employ the concatenation-based species 
tree construction methods with the neighbor-joining gene 
tree inference method using multiple individuals (see 
"Inferring gene trees" and "Inferring species trees"). 

Consider a dataset D of L loci sampled randomly with 
replacement from V p . Define 



ij ~ 



0 



> l =J 

J2AieDJ2x,yeAtdxyl{xeSi,yeSj} . 
Y.AieDY.x^eAi^yhxeS^yeSj] ' l ^ 



(i) 



where the indicator random variable l{ X eSi,yeSj} equals 1 if 
x e Si and y e Sj and 0 otherwise. The distance matrix 
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A 



1 23456789 1011 12131415 



1 23456789 1011 12131415 L 
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D s,0 
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Figure 2 Schematic for creating the four subsets X> s , X> s ,o/ T2p, and X> p , 0 from dataset X>. For the matrices of datasets V, V Sl V Si0l V p , and 
V Pi o (see Table 2), each row is an individual and each column is a locus. Thick black lines in these matrices separate the individuals in different 
species. Gray boxes indicate missing sequences. (A) At each locus, a single sequence from each species (indicated in red) is selected from dataset V. 
These selected sequences are used to create V s such that there exists a single sequence sampled per species at each locus. Sequences from a 
subset of loci in V s (indicated in yellow) are used to create dataset V s $ such that each locus has at least one nucleotide difference between each 
distinct pair of species other than pairs from distinct outgroups. (B) Dataset V p is the full starting dataset V. At each locus I, a distance matrix is 
created according to eq. 2. Sequences from a subset of loci (indicated in red) in V p are used to create dataset V Pi0 such that each locus has a 
nonzero p-distance between each distinct pair of species other than pairs from distinct outgroups. Observe that the V Pi o matrix includes loci 3 and 
7, which are not included in the V Si0 matrix. Loci 3 and 7 are included in V Pi0 but not in V Si0 because in V Pi q, pairs of species contain at least one 
pair of individuals with different sequences, whereas in X> s ,o# at least one pair of the 1 1 selected individuals have identical sequences. Therefore, the 
set of loci in V Pi q is a superset of the set of loci in V 5l o, and the number of loci in V Pi q is always greater than or equal to the number of loci in V s p. 



defined by eq. 1 is used to estimate gene trees for all strate- 
gies applied to V p . Given distinct species Si and Sp 
represents the ^-distance (fraction of nucleotide differ- 
ences; [63]) averaged over pairs of individuals, one from 
species i and the other from species /. Note that eq. 1 rep- 
resents a weighted rather than unweighted average for the 



mean ^-distance between species i and Although the 
distance is weighted, it is the same as a distance between 
pairs of species calculated on a concatenated alignment. 

The fourth dataset, V Pf o, is a subset of V p with L p> o < L 
loci. This subset consists of only those loci in V p for which 
there exists a pair of individuals in each distinct pair of 



DeGiorgio et al. BMC Evolutionary Biology 201 4, 14:67 
http://www.biomedcentral.eom/1 471 -21 48/1 4/67 



Page 8 of 21 



species (other than pairs from distinct outgroups) with at 
least one nucleotide difference between them. Define 



0 , i = j 

T.x,y^A t n xyl{xeSi,yeSj} ' 1 ^ ^ 



(2) 



where I^gS^gS,} is an indicator random variable that 
equals 1 if x e Si and y e Sj and 0 otherwise. The numer- 
ator of P| represents the number of pairwise sequence 
differences, summed over pairs of individuals, one from 
species Si and the other from species Sj, at locus L The 
denominator represents the sum across pairs of individ- 
uals, one from 5/ and the other from Sj, of the non-gap 
non-missing sequence shared between pairs of individu- 
als at locus I. To construct V p $, we create a subset of 
V p that consists only of those loci in V p for which the p- 
distance (P| > 0) is nonzero between each distinct pair 
of species (excluding pairs from distinct outgroups). This 
dataset is utilized by phylogenetic inference strategies 
that employ consensus methods with gene trees inferred 
by neighbor- joining using multiple individuals (see 
"Inferring gene trees" and "Inferring species trees"). Simi- 
larly to dataset V Sf o, this condition of a nonzero ^-distance 
between species pairs assists in constructing bifurcat- 
ing gene trees. We note that the species tree estimation 
approach taken in this study neither requires pairs of indi- 
viduals in the same species to have nonzero distances 
nor to have distances of zero. We only enforce that the 
distance calculated between pairs of species is nonzero. 

Inferring gene trees 

For each of the four datasets V s , V S) o, V p , and V p> o, we 
inferred gene trees from bootstrap samples [63-65] that 
contain loci randomly sampled with replacement from the 
dataset. For strategies applied to datasets V s and V s $, we 
inferred gene trees from sequence alignments by applying 
either maximum likelihood (ML; [63], ch. 9) under a gen- 
eral time-reversible substitution model ([63], ch. 13), max- 
imum parsimony (MP; [63], ch. 1), or neighbor-joining 
(NJ; [63], ch. 11) to a ^-distance matrix calculated between 
pairs of alignments. For strategies applied to V p and T> Pj o, 
we inferred gene trees by applying neighbor-joining to 
the V al1 and V 1 ^-distance matrices, respectively. We term 
the method for inferring gene trees from the V al1 and 
P^ ^-distance matrices "neighbor-joining using multiple 
individuals" (M). Gene trees were inferred using PAUP* 
[66]. Note that the estimation of gene trees on the scale 
explored in this study would be computationally inten- 
sive on the full set of sampled individuals; thus, we do 
not consider gene tree inference directly from alignments 
with multiple lineages sampled within species, and when 
exploring multiple lineages (as in M), we do so only with 
distance matrices between pairs of species rather than 
pairs of lineages. 



Inferring species trees 

We view as a species tree inference method any method 
that outputs a species tree estimate. The six species tree 
inference methods in this study are Concatenation [16,67], 
SuperMatrix Rooted Triple (SMRT; [22]), STEAC [21], 
STAR [21], Rooted Triple Consensus (RTC; [68]), and 
Minimize Deep Coalescences (MDC; [69,70]). Concate- 
nation and SMRT are concatenation-based, and STEAC, 
STAR, RTC, and MDC are consensus methods. Because 
we have adopted a unified two-stage framework for phy- 
logenetic inference strategies in which gene trees are 
first inferred by one approach and species trees are then 
inferred from gene trees by a second approach, we did 
not investigate single-stage approaches such as BEST 
[18,19], and "BEAST [20] that bypass gene tree inference 
or that perform gene tree inference simultaneously with 
species tree inference. Our analysis pipeline explores the 
performance of two-stage inference strategies when the 
roles of gene tree and species tree inference are sepa- 
rated, and it therefore requires that strategies estimate 
species trees from inferred gene trees and that they per- 
mit different gene tree inference methods to provide 
input to a given species tree method. The six species 
tree methods investigated in this article satisfy both of 
these conditions, whereas species tree methods such as 
BUCKy [17], BEST [18,19], and "BEAST [20] do not. Fur- 
ther, the methods we have selected are well-suited to a 
computationally intensive bootstrap approach included in 
our pipeline for generating distributions of species tree 
topologies, and the more computationally intensive of 
the single-stage methods would not be easily accommo- 
dated within this framework. Given the large number 
of two-stage methods available, it would not be possi- 
ble to be comprehensive; we have thus chosen a limited 
number of methods that represent a range of underly- 
ing principles. Our choice of methods permits a diverse 
set of criteria for estimating species trees to be evalu- 
ated, and the conceptual differences in the underlying 
methods enable some differentiation in behavior across 
methods. 

Consider a set of L loci (multiple alignments) with m 
ingroup and one outgroup species. Concatenation meth- 
ods concatenate the L alignments to create a single "super 
locus" consisting of an alignment of the m + 1 species 
across L loci. From this alignment, a gene tree is inferred 
by either maximum likelihood, maximum parsimony, or 
neighbor-joining— note that the definition of Concatena- 
tion does not require that gene trees be estimated using 
any specific method— and is then taken as the species 
tree estimate. Similarly, SMRT creates a concatenated 
alignment of the m + 1 species from a set of L align- 
ments. However, SMRT then constructs from this con- 
catenated alignment all (^) concatenated alignments of 
three ingroup species and an outgroup species. Rooted 
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three-taxon gene trees are then inferred from each of the 
(^) concatenated alignments. A supertree algorithm is 
then applied to the set of rooted three-taxon gene trees to 
estimate an m-taxon species tree topology. This study uses 
the Modified Mincut supertree algorithm implemented in 
the program SUPERTREE [71] to construct a species tree 
from rooted three-taxon gene trees. 

Consider a set of (m + l)-taxon gene trees (m ingroup 
and one outgroup species) inferred at each of L loci. 
STEAC estimates a species tree topology by using esti- 
mated mean coalescence times. For distinct species Si and 
Sj, the mean coalescence time is computed as the esti- 
mated coalescence time for Si and Sj, averaged over all L 
gene trees. This resulting mean is placed into a matrix of 
distances between species, to which neighbor-joining is 
applied to estimate the species tree topology (using the R 
phybase package). 

STAR estimates a species tree topology by using average 
coalescence ranks. It assumes that the rank of the root of 
a gene tree is equal to the number of species in the tree 
(m + 1 in our case). An internal node of a gene tree is then 
assigned one less than the rank of its immediate ances- 
tor. For distinct species Si and Sj, the average coalescence 
rank is computed as the rank of the node that connects 5/ 
and Sj, averaged over all L gene trees. Similarly to STEAC, 
these average coalescence ranks specify a matrix of dis- 
tances between species pairs. Neighbor-joining is applied 
to the matrix to estimate the species tree topology using 

PHYBASE. 

RTC estimates a species tree from rooted three-taxon 
tree topologies. At each locus i, i = 1, 2, . . . , L, RTC finds 
the set of (^) rooted tree topologies of three ingroup and 
one outgroup species that are displayed by the inferred 
gene tree at locus i. RTC then applies quartet puzzling 
[72] to the (™)L topologies to estimate the species tree 
topology (using the program TRIPLEC). 

A coalescence event between a pair of lineages is con- 
sidered "deep" if the coalescence does not occur in the 
first population in which the pair of lineages is capable of 
coalescing. Given a gene tree, the number of deep coales- 
cences on a species tree is defined as the total number of 
"extra lineages", summed across branches of the species 
tree topology, that is needed to fit the gene tree within the 
species tree topology. Here, the number of extra lineages 
for a branch is one fewer than the number of lineages that 
survive to the ancestral node of the branch; if incomplete 
lineage sorting does not occur, then only one lineage per- 
sists from a branch to a more ancestral branch, and there 
are no extra lineages. For a set of L gene trees, the number 
of deep coalescences for a species tree is the total num- 
ber of deep coalescences for the species tree given a gene 
tree, summed across the L gene trees. MDC estimates a 
species tree topology by minimizing the number of deep 
coalescences. That is, MDC finds a species tree topology 



for which the number of deep coalescences that will fit 
the set of L gene trees within the species tree topology is 
minimal. This study utilizes the MDC implementation in 
PhyloNet [70]. 

Multivariate analysis 

We aim to determine which of the 72 phylogenetic infer- 
ence strategies perform similarly, and we use multivari- 
ate analyses to define clusters of strategies that provide 
similar species tree estimates. Consider a 72 x 145- 
dimensional data matrix S in which rows represent strate- 
gies and columns represent 145 observed clades, among 
the (*) = 246 possible non-trivial clades (i.e., clades 

that contain more than one species and fewer than all ana- 
lyzed species) of eight species. Entry S// in column i and 
row j of S is the number of times that strategy i infers clade 
; in 1000 bootstrap replicates across loci. 

Principal components analysis (PCA) was applied to 
S to create a 72 x 2-dimensional matrix V, with rows 
representing strategies and the first and second columns 
representing the first and second principal components, 
respectively. Plotting strategies onto the space defined by 
these principal components yields a two-dimensional spa- 
tial "map" of phylogenetic inference strategies. We simi- 
larly applied multidimensional scaling (MDS) to a distance 
matrix for all ( ? 2 2 ) pairs of strategies, computing pair- 
wise distance as the mean Robinson-Foulds distance [73] 
across all 10 6 pairs of bootstrap trees, and extracting the 
first two components. We calculated the Robinson-Foulds 
distance using TREEDIST in PHYLIP. 

To compare spatial maps of phylogenetic inference 
strategies, we used Procrustes analysis [74-76]. In partic- 
ular, we compared the spatial distribution of a subset of 
72 — r strategies when analyzed alone to the spatial dis- 
tribution for all strategies. The comparison enabled us 
to quantify the influence that a set of r strategies with a 
particular feature (i.e., species tree construction method, 
gene tree inference method, or outgroup species) has on 
the full spatial distribution. Consider a proper subset £ = 
{^l) &2t ... > &72-r} of the full set of strategies. Consider a 
(72 — r) x 145-dimensional data matrix Ss in which rows 
represent the strategies in set £ and columns represent 
observed clades. Ss is a submatrix of S, in which the rows 
corresponding to strategies in S are selected from S. Con- 
sider a (72 — r) x 2 target matrix X and a (72 — r) x 2 
comparison matrix Y. X is matrix V restricted to the set 
of strategies E. Y represents the first two principal com- 
ponents in the PCA applied to matrix S^. Now consider 
a (72 — r) x 2 matrix Z = bYT + C, where b is a scal- 
ing factor, T is a 2 x 2 matrix that rotates and reflects Y, 
and C is a (72 — r) x 2 matrix that has constant columns 
and that is used to translate the matrix. Procrustes anal- 
ysis seeks to find b, T, and C to minimize the sum of 
squared differences between X and some (72 — r) x 2 
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matrix Z* = bYT + C That is, Z* is formally defined as 
Z* = argmin z {£g-r ^ (x „ _ z .. )2} The dissimi i arit y 

measure between X and Z* is computed as 

Egi r E;U(x,-z^ 
E^'EjLcx, -,*,)'* 

where fij = Y^i=\ r ^ij ls tne A n entry of the centroid 
of X. This measure takes the sum of squared differences 
between points on the spatial maps defined by X and 
Z* and normalizes it by the sum of squared differences 
between the points on the spatial map defined by X and 
their centroid. 

Define a cluster as a set of strategies and let the centroid 
of a cluster be the location in the 145-dimensional space 
of clades whose coordinates are the means of those of all 
strategies in the cluster. Hierarchical clustering was per- 
formed by first creating a matrix of Euclidean distances 
between all ( 7 2 2 ) pairs of 145-dimensional vectors repre- 
sented by the matrix S. Define the within-cluster sum of 
squared Euclidean distance as the squared Euclidean dis- 
tance between a point in a cluster and the cluster centroid, 
summed over all points in the cluster. From the 72 x 
72-dimensional matrix of Euclidean distances between 
strategies, a dendrogram relating the strategies was con- 
structed using the Ward algorithm [77], which iteratively 
merges clusters until all points are contained within a sin- 
gle cluster. For a given iteration, two clusters are merged 
if their merged cluster has a smaller within-cluster sum 
of squared Euclidean distances than any other potential 
merged cluster. The nesting of clusters created by the 
algorithm defines the dendrogram. 

We performed 7<T-means clustering on the 72 145- 
dimensional vectors, using K clusters, K = 2, 3, . . . , 9. 
Given 7<T, strategies were separated into K clusters on the 
basis of the squared Euclidean distance between all pairs 
of the strategies in a 145-dimensional space. We ran 10 4 
replicates with random starting locations. Each replicate 
yielded a total within-cluster sum of squared distances 
for the set of K clusters, representing the within-cluster 
sum of squared distances between points in a cluster and 
the cluster centroid, summed over all K clusters. We then 
chose the set of cluster assignments that had the mini- 
mum total within-cluster sum of squared distances, where 
the minimum was taken over all 10 4 replicate starting 
locations. 

To compute the Pearson correlation coefficient between 
a pair of strategies, we only used points in the 145- 
dimensional vector that were nonzero in both strategies 
being compared. 

Results 

We accounted for the variable outcomes of individual phy- 
logenetic inference strategies by applying the strategies to 



bootstrap datasets instead of their respective full datasets. 
Our analysis identified 145 distinct clades observed in 
the set of 72 phylogenetic inference strategies, among 
246 possible non-trivial clades on eight species, across 
1000 bootstrap replicates for each strategy. From these 
clades, we created a 72 x 145 matrix S in which each 
row is a strategy and each column is a clade. The value 
of Sip the cell in row i and column is the number 
of times among the 1000 bootstrap replicates that strat- 
egy i inferred a species tree with clade This summa- 
rized dataset S of clade counts was used for all further 
analyses. 

Clade size 

We first investigated the level of balance [78-81] in 
the tree topologies inferred by each phylogenetic infer- 
ence strategy. The distribution of clade sizes (number 
of taxa within a clade) provides a basis for measur- 
ing tree topological balance. Topologies with numerous 
small clades tend to be more balanced than topolo- 
gies with large clades. For example, consider the topolo- 
gies T bal = (((AB)(CD))((EF)(GH))) and T unbal = 
(((((((AB)C)D)E)F)G)H). Topology T bai is the most bal- 
anced eight-taxon topology whereas T unba i is the most 
unbalanced eight-taxon topology. Considering non-trivial 
clades, T ba \ has four clades of size two and two clades of 
size four. T unba i has one clade each of size two, three, four, 
five, six, and seven. Thus, the clades of T ba i are smaller 
than those of T unba i. The mean clade size for T ba \ is ~2.67 
and the mean clade size for T unba i is 4.5. 

Figure 3A displays the cumulative distribution of clade 
sizes for each of the 72 phylogenetic inference strate- 
gies, considering all 1000 bootstrap replicate species trees 
for each strategy. This cumulative distribution increases 
most quickly for strategies based on MDC, for which 
most of the distribution is located in clades of size two. 
By contrast, it increases most slowly for strategies based 
on SMRT and STEAC, for which much of the distribu- 
tion is located in clades of size six and seven. Figure 3B 
displays a bar graph of the mean clade size for each 
of the 72 phylogenetic inference strategies. This graph 
shows that among all six species tree construction meth- 
ods, the 12 MDC strategies have the smallest mean clade 
size as well as the smallest variance in mean clade size 
across the 12 combinations of outgroup and gene tree 
inference method. In contrast, SMRT and STEAC in gen- 
eral have the largest mean clade size. However, all 12 
SMRT strategies infer trees with large mean clade size, 
whereas the mean clade size of STEAC varies across 
the 12 combinations of outgroup and gene tree infer- 
ence method. Interestingly, the mean clade size aver- 
aged over all 12 strategies based on MDC is ~2.79, 
a value that is close to the mean clade size for T ba i 
of -2.67. 
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Figure 3 Distribution of clade size for phylogenetic inference 
strategies. (A) Cumulative distribution of clade sizes. Each bar 
represents a strategy, of which there are 1 2 per color. (B) Mean clade 
size for phylogenetic inference strategies, calculated as the mean size 
of all clades inferred across 1000 bootstrap replicates. Vertical lines 
centered at the top of vertical bars represent standard errors of mean 
clade sizes. 



Clustering of strategies 

We next used PCA, MDS, hierarchical clustering, K- 
means clustering, and correlation analysis on the matrix 
of clades S to identify phylogenetic inference strategies 
that perform similarly. Figure 4 displays plots of the first 
two principal components, which account for 38.94% and 
18.96% of the variation across strategies, respectively. 
Figure 4A shows that separate clusters are formed by 



strategies that are based on Concatenation, SMRT, and 
STEAC, and that strategies based on STAR, RTC, and 
MDC form a cluster together. Further, a larger cluster 
is formed by strategies that are based on Concatenation, 
SMRT, and STEAC, and another larger cluster is formed 
by strategies that are based on STAR, RTC, and MDC. 
These larger clusters have a simple interpretation in that 
one of the larger clusters contains topologically-based 
strategies (STAR, RTC, and MDC) and the other contains 
strategies that are not strictly topologically-based (Con- 
catenation, SMRT, and STEAC). Strategies are classified 
as topologically-based if they only use information on tree 
topologies to construct a species tree. In contrast, strate- 
gies are classified as not strictly topologically-based if they 
use information other than the gene tree topologies, such 
as sequence or branch length information, to construct a 
species tree. Relabeling the points in Figure 4A accord- 
ing to gene tree inference method, Figure 4B shows that 
strategies that are based on M (i.e., multiple individuals) 
form a cluster, and that there are no separate clusters for 
strategies that are based on ML, MP, or NJ. Figure 4C, 
which labels points according to outgroup, shows that 
no strategies separate into clusters based on the choice 
of outgroup. When we apply MDS to Robinson-Foulds 
distances between the sets of bootstrap replicate trees 
produced by pairs of strategies (Figure 4), we obtain sim- 
ilar observations of the clusters of strategies, detecting 
an important role for M and for the difference between 
topologically-based and non-topologically-based strate- 
gies, and no strong signal for the outgroup choice. 

From Figure 4, we can see that much of the varia- 
tion across the 72 phylogenetic inference strategies, as 
explained by PCA and MDS, is caused by M. Strategies 
based on M are more similar in clade outcomes to other 
strategies based on M than they are to other strategies 
that are not based on M. The magnitude of this effect can 
be quantified using Procrustes analysis, which demon- 
strates that M has a large influence on the spatial rela- 
tionship among all other phylogenetic inference strategies 
(Additional file 2: Figure SI). 

Figure 5 shows the results of our cluster and corre- 
lation analyses. The main clusters formed by phyloge- 
netic inference strategies involve strategies based on the 
species tree construction methods Concatenation, SMRT, 
STEAC, and MDC or the gene tree inference method M 
(Figure 5). The clusters of strategies formed by 7<T-means 
and the large groupings of strategies formed by hierar- 
chical clustering are quite similar. Additionally, the corre- 
lation coefficient between clade vectors inferred by pairs 
of phylogenetic inference strategies is generally higher for 
pairs of strategies that are placed into the same cluster by 
either 7<T-means or hierarchical clustering than for pairs 
of strategies that are not placed into the same cluster 
(Figure 5). 
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Figure 4 Principal components analysis (PCA) and multidimensional scaling (MDS) of phylogenetic inference strategies. PCA was applied 
to 72 phylogenetic inference strategies, in which each strategy is viewed as a point in a 145-dimensional space of clades. MDS was applied to a 
distance matrix between all pairs of strategies, where the distance between a pair of strategies was the mean Robinson-Foulds distance between all 
pairs of bootstrap trees from one strategy and another strategy {i.e., mean of 1 0 6 comparisons). The plots show the first and second components. 
On a given row, each plot represents the same 72 points in the space of components 1 and 2; the three plots are colored differently to highlight 
different features. (A) Colors represent different methods for constructing species trees. (B) Colors represent different gene tree inference methods. 
(C) Colors represent different outgroups. The points on each graph represent different combinations of the three factors that form phylogenetic 
inference strategies. Each line in part A represents the resultant vector (scaled by a constant to lie within the span of the 72 points) for all 1 2 points 
of a certain method for constructing species trees. Each line in part B represents the resultant vector for all 18 points of a gene tree inference 
method (scaled by a constant). Each line in part C represents the resultant vector for all 24 points of an outgroup (scaled by a constant). The scaling 
constants in parts A, B, and C are distinct. Each of the shaded regions in parts A and B is a convex hull of the points from a particular method. 



Interestingly, the clustering of strategies by PCA and 
MDS in Figure 4 matches well with the groupings 
observed in Figure 5, which is likely driven by similar 
signals. In Figure 5, three large clusters are represented 
by the subtree to the left of the root of the dendro- 
gram (i.e., the blue color in the 7<T-means clustering with 
K = 3) and by two subtrees to the right of the root 
(i.e., the pink and orange colors at K = 3). The two 
subtrees to the right of the root (or pink and orange clus- 
ters defined by 7<T-means clustering) involve strategies that 
are based on M (pink 7<T-means cluster or left subtree 



on the right of the root of the dendrogram) or strate- 
gies that are based on species tree construction methods 
that are topologically-based (orange 7<T-means cluster or 
right subtree on the right of the root of the dendrogram). 
That is, strategies that correspond to the orange cluster 
are based on either STAR, RTC, or MDC. In contrast, 
the subtree to the left of the root (or the blue cluster 
defined by 7<T-means clustering) contains only strategies 
that use species tree construction methods that are not 
strictly topologically-based (i.e., Concatenation, SMRT, or 
STEAC). 
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Figure 5 Cluster and correlation analysis of phylogenetic inference strategies. Each leaf of the dendrogram corresponds to a different 
phylogenetic inference strategy for obtaining the rooted phylogeny of eight ingroup pine species. Blue squares directly below the dendrogram 
indicate the features used to construct a rooted phylogeny for the eight pine species. The first six rows below the dendrogram represent different 
species tree construction methods. The next four rows below the dendrogram represent gene tree inference methods. The following three rows 
below the dendrogram represent the outgroup species. The dendrogram was constructed by hierarchical clustering using the Ward algorithm [77] 
applied to a matrix of Euclidean distances between all ( 7 2 2 ) pairs of 1 45-dimensional vectors (each dimension representing a distinct clade). The 
remaining nine rows below the outgroups show the results of /(-means clustering applied to the 72 1 45-dimensional vectors with /(clusters, 
K = 2,3, ... ,9. Below the cluster analysis is a heat map of the correlation coefficients between all ( 7 2 2 ) pairs of phylogenetic inference strategies. An 
entry in the heat map represents the Pearson correlation coefficient between a pair of strategies by only using points in the 1 45-dimensional vector 
that were nonzero in both strategies being compared. 
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From Figures 4 and 5, we find that phylogenetic 
inference strategies form three basic clusters: a cluster 
that involves strategies that are based on M, a clus- 
ter that involves strategies that are topologically-based, 
and a cluster that involves strategies that are not strictly 
topologically-based. 

Clade flow 

Following the "haplotype flow" computations of [82], we 
can view "clade flow" as a proportion of clades inferred by 
one phylogenetic inference strategy that are also inferred 
by another strategy. Figure 6 displays a heat map that rep- 
resents a form of clade flow, where the cell at row i and 
column ; in the heat map represents the fraction of clades 
inferred by strategy i that were not inferred by strategy 
/. By definition, the heat map is not symmetric. As can 
be seen from the mostly white and yellow boxes for rows 
corresponding to strategies based on M, these strategies 
tend to infer clades that are supported by other strategies. 
That is, if a species tree topology is inferred by a strategy 
that is based on M, then clades displayed by that topol- 
ogy will often also be present on species tree topologies 
inferred by other strategies. In Additional file 2: Figure SI, 
strategies based on M contribute to the most variation 
across strategies. A possible explanation for this obser- 
vation is that the flow of clades is largely unidirectional. 
That is, if a strategy is based on M, then clades that are 
inferred by that strategy also tend to be supported by other 
strategies; however, if a strategy not based on M infers a 
clade, then that clade is not often supported by strategies 
based on M. Because clades inferred by strategies based 
on M also tend to be supported by other strategies, it fol- 
lows that strategies based on M tend to infer clades that 
are also supported by other strategies based on M. This 
sharing of clades among strategies based on M causes 
those strategies to be more similar to each other than 
they are to strategies not based on M. In contrast to the 
results for M, as can be seen from the mostly dark boxes 
in rows for strategies based on MDC, strategies based 
on MDC tend to infer clades that are not supported by 
other strategies (especially when compared with strategies 
based on M). 

Similarly to the behavior of MDC, strategies that are 
based on Concatenation, SMRT, and STEAC together 
with ML, MP, or NJ share more clades with other such 
strategies (mostly white and yellow boxes) than with the 
remaining strategies (mostly dark boxes). In contrast, as 
was observed with M, strategies based on STAR and RTC 
together with ML, MP, or NJ share similar numbers of 
clades among other such strategies as with the remain- 
ing strategies (mostly yellow boxes). These results suggest 
that strategies that are topologically-based [i.e., STAR and 
RTC) tend to infer clades that are also supported both 
by other topologically-based strategies and by strategies 



that are not strictly topologically-based, whereas strate- 
gies that are not strictly topologically-based [i.e., Concate- 
nation, SMRT, and STEAC) tend to infer clades that are 
not supported by strategies that are strictly topologically- 
based {i.e., STAR, RTC, and MDC). 

Representative topologies 

We next wanted to use a set of representative species 
tree topologies to highlight similarities and differences in 
topologies constructed by various strategies. Topologies 
were estimated using the Greedy Consensus algorithm 
[12] applied to clade counts. Because our previous results 
(Figures 4-5) indicate that the choice of outgroup species 
does not strongly influence the overall inferred topolo- 
gies, it is sensible to average across outgroups. Therefore, 
we first present topologies for each of the 24 species 
tree-gene tree inference method pairs constructed from 
clade counts that were averaged over the three outgroups 
(Figure 7). Next, to obtain a clearer picture of the types of 
topologies that are inferred by the six species tree infer- 
ence methods, we present topologies for each of the six 
species tree inference methods, constructed from clade 
counts that were averaged over gene tree inference meth- 
ods and outgroup species (Figure 8). Finally, to assess the 
influence that various gene tree inference methods have 
on the overall inferred species tree topology, we present 
topologies for each of the four gene tree inference meth- 
ods, constructed from clade counts that were averaged 
over species tree inference methods and outgroup species 
(Figure 9). 

Figure 7 displays 24 topologies with clade support val- 
ues for each combination of a species tree construction 
method and a gene tree inference method. The clade 
{P chiapensis, P strobus) is generally highly supported, 
appearing for 22 of 24 strategies, with support ranging 
from 382 to 982 among 1000 bootstrap replicates. The 
smallest support values for {P chiapensis, R strobus} occur 
in strategies that use SMRT with ML, MP, and NJ, pro- 
ducing support values of 382, 406, and 395, respectively. 
The largest support values for this clade occur in strate- 
gies that use M, with values ranging from 824 to 982. 
Further, although strategies based on SMRT with ML, 
MP, and NJ yield lower support values than other strate- 
gies, when SMRT is combined with M, the support for 
{P chiapensis, P strobus] is 905. In addition, although 
two of the strategies based on STEAC do not support 
{P chiapensis, P strobus}, when STEAC is combined with 
M, the support for the clade is 982. Another clade that 
is highly supported is {P ayacahuite, P flexilis, P strobi- 
formis}. This clade is observed across all strategies, with 
support among non-MDC strategies out of 1000 boot- 
strap replicates ranging from 858 to 1000. Strategies that 
use MDC with ML, MP, and NJ yield support values for 
{P ayacahuite, P flexilis, P strobiformis] of 560, 407, and 
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Figure 6 Heat map representing the "flow" of clades between phylogenetic inference strategies. We use clade flow to measure the 
proportion of clades inferred by one strategy that were not inferred by a different strategy. Phylogenetic inference strategies are ordered using the 
dendrogram. The cell at row /' and column j represents the fraction of clades inferred by strategy /'that were not inferred by strategy;'. By definition, 
the heat map is asymmetric. Darker colors indicate lower levels of "flow" from a row to a column. 



415, respectively. However, using MDC with M yields a 
support value of 933 for {P. ayacahuite, P. flexilis, P. strob- 
iformis}. Across the 24 trees, the topological positions of 
P. albicaulis, P. lambertiana, and P monticola are variable 
and are generally poorly supported. Each of these species 
is found in a variety of positions across all trees. 

Figure 8 displays six topologies with clade support 
values for each species tree construction method. Sim- 
ilarly to Figure 7, the clade {P chiapensis, P strobus} is 



generally highly supported across all six species tree con- 
struction methods, with support ranging from 522 to 876 
among 1000 bootstrap replicates. Also, as in Figure 7, the 
clade {P ayacahuite, P flexilis, P strobiformis) is highly 
supported across all six species tree construction meth- 
ods, with support ranging from 579 to 999 among 1000 
bootstrap replicates. From these topologies, we can also 
observe that in agreement with the clade size distribution, 
strategies based on Concatenation, SMRT, and STEAC 
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Figure 7 Rooted consensus trees of phylogenetic inference strategies averaged over outgroups. For a given subset of the 72 phylogenetic 
inference strategies considered, the bootstrap support for each of the clades that appeared in at least one tree was averaged over the set of strategies 
to create a set of counts for each of the clades. Greedy consensus trees [1 2] were then created using the clade counts in the set. Each clade count in 
the set has a maximum value of 1 000, because each element of the set is an average over values that each have a maximum of 1 000. Each consensus 
tree is the greedy consensus tree based on clade counts averaged over outgroup species. These trees disregard branch-length information. 



tend to produce more unbalanced trees than strategies 
based on STAR, RTC, and MDC (Figure 3). Strategies 
based on Concatenation, SMRT, and STEAC support 
topologies in which £ lambertiana is on the opposite 
side of the root from the other seven species. In con- 
trast, strategies based on STAR, RTC, and MDC place 
£ monticola and J! lambertiana as sister species. These 



results support the observations from Figures 4, 5, and 6 
that strategies based on species tree construction meth- 
ods that are topologically-based behave differently from 
strategies that are not strictly topologically-based. 

Figure 9 displays four topologies with clade support val- 
ues, considering each gene tree inference method and 
combining species tree construction methods for each 
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Figure 8 Rooted consensus trees of phylogenetic inference strategies averaged over outgroups and gene tree inference methods. For a 

given subset of the 72 phylogenetic inference strategies considered in this article, the bootstrap support for each of the clades that appeared in at 
least one tree was averaged over the set of strategies to create a set of counts for each of the clades. Greedy consensus trees [12] were then created 
using the clade counts in the set. Each clade count in the set has a maximum value of 1 000, because each element of the set is an average over 
values that each have a maximum of 1 000. These trees disregard branch-length information. (A) Trees constructed using the 1 2 strategies that 
utilize Concatenation; (B) SMRT; (C) STEAC; (D) STAR; (E) RTC; (F) MDC. 



gene tree inference method. As in Figures 7 and 8, 
the clades {P. chiapensis, P strobus} and {P ayacahuite, 
P flexilis, P strobiformis} are generally highly supported 
across all four gene tree inference methods, with sup- 
ports among 1000 bootstrap replicates respectively rang- 
ing from 610 to 931 and from 858 to 988. 

Discussion 

In this article, we have empirically evaluated strate- 
gies for inferring species tree topologies from multilocus 
sequence data. We have found that MDC tends to infer 
balanced topologies, whereas SMRT and STEAC tend to 
infer more unbalanced topologies. This bias toward bal- 
anced topologies exhibited by MDC is a consequence of 
the nature of the criterion that MDC uses to construct 
species trees, reflecting a theoretical finding that species 
trees with more balance have lower deep coalescence costs 
[83]. 



The strategies that we have examined fall into three 
classes in terms of the species tree inferences they pro- 
duce: strategies applied only to datasets including all 
available sequenced individuals {i.e., M), topologically- 
based strategies {i.e., STAR, RTC and MDC), and 
strategies that are not strictly topologically-based {i.e., 
Concatenation, SMRT, and STEAC). While it is not unex- 
pected that some approaches would behave similarly, it 
is surprising that strategies did not cluster based on 
the dataset or approach used (e.g., consensus or con- 
catenation). Instead, strategies that take quite different 
species tree construction approaches (e.g., consensus- 
based STEAC and concatenation-based Concatenation 
and SMRT) form a cluster. Topologically-based strategies 
tend to infer clades that are supported by other strategies, 
whereas strategies that are not strictly topologically-based 
tend to infer clades that are not always well-supported 
by other strategies. For example, clades inferred from 
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Figure 9 Rooted consensus trees of phylogenetic inference strategies averaged over outgroups and species tree construction methods. 

For a given subset of the 72 phylogenetic inference strategies considered in this article, the bootstrap support for each of the clades that appeared 
in at least one tree was averaged over the set of strategies to create a set of counts for each of the clades. Greedy consensus trees [12] were then 
created using the clade counts in the set. Each clade count in the set has a maximum value of 1 000, because each element of the set is an average 
over values that each have a maximum of 1 000. These trees disregard branch-length information. (A) Trees constructed using the 1 8 strategies that 
utilize ML; (B) MP; (C) NJ; (D) M. 



DeGiorgio et al. BMC Evolutionary Biology 201 4, 14:67 
http://www.biomedcentral.eom/1 471 -21 48/1 4/67 



Page 18 of 21 



strategies that are not strictly topologically-based tend 
not to appear on trees that were inferred from strate- 
gies that are topologically-based. A possible reason for 
this observation could be a lack of phylogenetic signal to 
properly infer branch lengths; that is, if a phylogenetic 
inference strategy is not strictly topologically-based, then 
sequences with little phylogenetic signal (e.g., due to low 
substitution rate or short length) can strongly influence 
the species tree inferred by that strategy. Because STEAC 
uses branch-length information to infer a species tree 
topology, sequences with little signal can reduce its per- 
formance relative to topologically-based methods such as 
STAR [15]. 

Although our main goal has been to use North 
American pines to investigate relationships among phylo- 
genetic inference strategies, our results also provide some 
information about the phylogenetic placement of the pine 
species in the study. This analysis is the first multilocus 
study to provide substantial confidence for a sister rela- 
tionship of R chiapensis and R strobus. Pinus chiapensis is 
a threatened species of Mexico and Guatemala whose phy- 
logenetic affinity has been uncertain. Morphological and 
molecular evidence have been used to alternately argue for 
a sister relationship between P. chiapensis and P. strobus, 
from eastern North America, or between J! chiapensis and 
R monticola, from western North America [47]. Here, 22 
of 24 trees in Figure 7 grouped R chiapensis and R strobus 
as sister taxa, mostly with reasonably high bootstrap sup- 
port. When phylogenetic inference strategies were aver- 
aged over either gene tree methods (Figure 8) or species 
tree methods (Figure 9), the {R chiapensis, R strobus} clade 
was always recovered. 

The close phylogenetic affiliation of R ayacahuite, 
R flexilis, and R strobiformis has long been suspected, 
as these three species represent similar forms that are 
continuously distributed from southern British Columbia 
and Alberta into Honduras [84] . Here, the {R ayacahuite, 
R flexilis, R strobiformis} clade is well-supported, although 
relationships among these three species are less clear. Two 
possibilities, namely ((R flexilis, R strobiformis), R ayac- 
ahuite) and ((R ayacahuite, R strobiformis), R flexilis), 
appear more likely based on our analysis (Figures 7, 8 and 
9). Interestingly, Figure 7 finds that the ((R ayacahuite, 
R strobiformis), R flexilis) clade is well-supported by all 
strategies that use M. 

Beyond these clades, the full phylogeny of this group of 
pines remains unclear. Considering the trees inferred in 
Figure 7, relationships among R albicaulis, R lambertiana, 
R monticola, and the clades {R chiapensis, R strobus} and 
{R ayacahuite, R flexilis, R strobiformis} are not stable 
across inference strategies, and bootstrap support is gen- 
erally low. We might have expected greater resolution in 
this study, due to the exhaustive sample of the ingroup, 
extensive intraspecific sampling, large molecular dataset, 



and ease of species delimitation (the eight ingroup species 
include well-defined taxa that are morphologically, eco- 
logically, phenologically, and generally geographically dis- 
tinct). 

We can attribute the lack of resolution in the pine 
phylogeny to several possible sources. First, the loci in 
the study were chosen because they amplify across a 
broad range of taxa from subgenus Strobus (only eight of 
whose members are included here), and might therefore 
be more slowly evolving and less informative for phylo- 
genetic inference than typical loci. Thus, the size of the 
dataset might not be indicative of its information content 
for phylogenetic inference. Second, we have focused on 
strategies that have been implemented for ease of compar- 
ison and have not explored the full collection of available 
methods (e.g., [6,20,85,86]), nor have we considered such 
techniques as investigation of different subsets of taxa or 
loci on the basis of the strategies that we have studied (e.g., 
[16]). A study with a primary goal of resolving the pine 
phylogeny might achieve greater resolution through anal- 
yses that deviate from our standardized procedure. Third, 
the speciation events of interest might have occurred fast 
enough that retention of ancestral polymorphisms, as has 
been observed elsewhere among conifers [47,48,50,87], 
might inhibit convergence on a stable, well-supported 
topology. Further work with more loci or faster-evolving 
loci will be important for distinguishing among these 
alternatives. 

One caveat for interpreting our results is that except in 
our analyses based on M, we only considered a single lin- 
eage sampled within a species. Information on multiple 
lineages of the same species can have a significant effect 
on the performance of species tree inference methods, 
and many methods can use information on coalescences 
within and between species as part of the inference pro- 
cess (e.g., [19-21,23,25-27,88-90]). Therefore, it is impor- 
tant to keep in mind that we have used one of a number 
of potential schemes for sampling individuals within our 
data, as sampling scheme can have an impact on the 
efficacy of species tree estimators [15,20,36,91,92]. 

Another caveat is that some of the datasets were 
obtained from procedures designed to maximize informa- 
tion content at each locus. These optimization procedures 
yielded datasets with one sequence sampled per species. 
Because the sequences within these optimized datasets 
are no longer randomly sampled within each species, a 
possible concern is that our results are not representa- 
tive of random samples. This concern might be warranted 
when considering the inferred relationships of the various 
pine species in Figures 7, 8 and 9. However, as the strate- 
gies applied to each of these optimized datasets retain 
their general relationships across datasets (e.g., those that 
are topologically-based and those that are not strictly 
topologically-based), the conclusions drawn in this article 
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should also hold for large noncoding datasets. Addition- 
ally, it is important to mention that an identical dataset is 
not used for all strategies considered, and notably, strate- 
gies that used M relied on different datasets from those 
that used ML, MP, and NJ. Although this is a limita- 
tion, the clustering pattern suggests it is not a major 
concern. In our principal components (Figure 4), clus- 
ter (Figure 5), and correlation (Figure 5) analyses, though 
the strategies were split by whether they used M or were 
either topologically-based or not strictly topologically- 
based, these three categories do not precisely map onto 
the different datasets. Therefore, though the dataset varies 
across the 72 strategies, other factors beyond the dif- 
ference in datasets are contributing substantially to the 
difference in results. 

Finally, a third caveat is that to maintain a uniform 
procedure across strategies, we did not estimate the 
nucleotide substitution model before applying maximum 
likelihood. This choice might have caused some system- 
atic bias in ML gene tree estimates by overparametrizing 
the substitution model. However, we found that our infer- 
ence strategies did not cluster by whether maximum likeli- 
hood, maximum parsimony, or neighbor-joining was used 
(see Figures 4B and 5), suggesting that any systematic bias 
due to using a general time-reversible substitution model 
did not drive our observed clustering patterns. 

Conclusions 

Based on a collection of two-stage strategies that we 
have investigated, representing a subset of all available 
methods, our analyses have highlighted several aspects of 
phylogenetic inference strategies that enable recommen- 
dations for inferring rooted phylogenies from large-scale 
multilocus data. First, it is beneficial to examine multi- 
ple strategies [93], considering some methods that use 
only topological information (e.g., STAR, RTC, and MDC) 
and others that also incorporate additional information 
(e.g., Concatenation, SMRT, and STEAC). If species tree 
topologies returned by these different classes of species 
tree construction methods agree, then an investigator can 
be more confident in the inferred tree topology. Second, 
estimates should not be based solely on species tree con- 
struction methods that appear to be biased toward certain 
types of topologies (e.g., MDC). Instead, it is preferable to 
utilize these types of methods in conjunction with other 
approaches. For example, after obtaining an unbalanced 
inferred tree from an inference method, if MDC also infers 
the same unbalanced topology, then we can be more confi- 
dent that the true species topology is actually unbalanced. 
Finally, it is best to utilize as much information as is avail- 
able on individuals at every locus. That is, if multiple 
individuals are sampled within a species at a given locus, 
then we should use all available sequence data from the 
species (i.e. as many sampled individuals as possible, as in 



strategies that are based on M). This point is supported by 
the observation that clades inferred by M tend to "flow" to 
other strategies (Figure 6). Based on our findings, we rec- 
ommend the joint consideration of multiple approaches 
to estimating species trees that originate in different loca- 
tions in the space of methods and that exhibit diverse 
properties in their species tree estimates. 
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